Download the “old faithful” data set from blackboard. This contains samples of a 2-D random variable:
the first dimension is the duration of the old faithful geyser eruptions. The second is the waiting time
between eruptions. Generate a 2-D scatter plot of the data. Run a k-means clustering routine on the data
for k=2. Show the two clusters in a scatterplot
In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.cluster import KMeans
import seaborn as sns
import numpy as np
In [2]:
df = pd.read_table('data/old-faithful.txt')
df.head()
Out[2]:
In [3]:
df.plot.scatter('eruption', 'waiting')
Out[3]:
In [4]:
y_pred = KMeans(n_clusters=2, random_state=0).fit_predict(df)
In [5]:
plt.scatter(df[[0]], df[[1]], c=y_pred)
Out[5]:
In [6]:
def mixture_model(mu1,mu2,s1,s2,alpha):
return alpha*np.random.normal(mu1, s1, 1000) + (1-alpha)*np.random.normal(mu2, s2, 1000)
mixture_samples = mixture_model(-1,1,1,1,0.4)
plt.scatter(range(1000), mixture_samples)
Out[6]:
In [7]:
plt.hist(mixture_samples, bins=20)
Out[7]:
In [8]:
y_pred = KMeans(n_clusters=2, random_state=0).fit_predict(mixture_samples.reshape(-1,1))
In [9]:
plt.scatter(range(1000), mixture_samples, c=y_pred)
Out[9]: